Lab 4: Intro to Machine Learning

Practice session covering topics discussed in Lecture 4

M. Chiara Mimmi, Ph.D. | Università degli Studi di Pavia

July 27, 2024

GOAL OF TODAY’S PRACTICE SESSION

  • Review the basic questions we can ask about ASSOCIATION between any two variables:
    • does it exist?
    • how strong is it?
    • what is its direction?
  • Introduce a widely used analytical tool: REGRESSION



The examples and code from this lab session follow very closely …..:

Topics discussed in Lecture # 4

Lecture 4: topics

  • Shifting the emphasis on empirical prediction
    • Distinction between supervised & unsupervised algorithms
      • Unsupervised ML Example
        • PCA
  • Useful R resources for metabolomics
    • Introduction to MetaboAnalyst software
  • Elements of statistical power analysis

R ENVIRONMENT SET UP & DATA

Needed R Packages

  • We will use functions from packages base, utils, and stats (pre-installed and pre-loaded)
  • We will also use the packages below (specifying package::function for clarity).
# Load them for this R session

# --- General 
library(fs)      # file/directory interactions
library(here)    # tools find your project's files, based on working directory
library(paint) # paint data.frames summaries in colour
library(janitor) # tools for examining and cleaning data
library(dplyr)   # {tidyverse} tools for manipulating and summarizing tidy data 
library(forcats) # {tidyverse} tool for handling factors
library(openxlsx) # Read, Write and Edit xlsx Files
library(flextable) # Functions for Tabular Reporting

# --- Statistics
library(rstatix) # Pipe-Friendly Framework for Basic Statistical Tests
library(lmtest) # Testing Linear Regression Models # Testing Linear Regression Models
library(performance) # Assessment of Regression Models Performance 

# --- Tidymodels (meta package)
#library(tidymodels) # not installed on this machine
library(rsample) # not installed on this machine
library(broom) # Convert Statistical Objects into Tidy Tibbles

# Plotting
library(ggplot2) # Create Elegant Data Visualisations Using the Grammar of Graphics

DATASETS for today

We will use examples (with adapted datasets) from real clinical studies, provided among the learning materials of the open access books:

Importing Dataset 1 (NMR output)

Name: … .
Documentation: …
Sampling details: …

# Check my working directory location
# here::here()

# Use `here` in specifying all the subfolders AFTER the working directory 
nmr <- read.csv(file = here::here("practice", "data_input", "04_datasets",
                                      "nmr_bins_test_data_PCA_PLSDA.csv"), 
                          header = TRUE, # 1st line is the name of the variables
                          sep = ",", # which is the field separator character.
                          na.strings = c("?","NA" ), # specific MISSING values  
                          row.names = NULL) 
  • Adapting the function here to match your own folder structure

NMR output Variables and their description

nmr_desc <- tribble(
  ~Variable, ~ Type, ~Description,
"X"                ,"int", "xxxx", 
 
"..."  ,"...",  "..."#, 
)

kableExtra::kable(nmr_desc)
Variable Type Description
X int xxxx
... ... ...

MACHINE LEARNING: A FOCUS ON PREDICTION

Introducing R (metapackage) tidymodels for modeling and ML

tidymodels (much like the tidyverse was to data manipulation) is an ecosystem of packages meant to enable a wide variety of approaches for modeling and statistical analysis that shares the underlying design philosophy, grammar, and data structures of the tidyverse.

Also in this case we could install them al in one install.packages("tidymodels"), or one by one (my preference) to appreciate what each contributes to the process. (Figure 1).

Introducing R (metapackage) tidymodels (cont.)

a layered approach!

Figure 1: Tidymodels ecosystem

Splitting the dataset into training and testing samples

  • 🚫 In ML, training models based on all of the data at once is typically not a good choice.

  • 👍🏻 Instead, you can create subsets of your data that you use for different purposes, such as training your model and then testing your model.

    So, when you evaluate your model on data that it was not trained on, you get a better estimate of how it will perform on new data.

_______

ML WITH SUPERVISED ALGORITHMS

PCA: step by step (example)

  1. PCA fatta a mano. PCA step by step come in Statology ma con il data set della Lecture nmr_bins…csv

https://www.statology.org/principal-components-analysis-in-r/

Probabilmente non viene proprio uguale perchè in MA fa normalizzazione e scaling mentre Statology fa solo scaling, ma fa niente, diciamo che ci serve per vedere la differenza

PLS-DA: step by step (example)

  1. PCA + PLS_DA + CLuster https://rpubs.com/Anita_0736/PD_ANALYSIS

  2. PLS fatta a mano PLS step by step come in Statology ma con il data set della Lecture nmr_bins…csv

https://www.statology.org/partial-least-squares-in-r/

In MetaboAnalyst usano la PLS-DA che non so cosa ha di diverso ma può essere anche carino vedere la differenza

_______

ML WITH UNSUPERVISED ALGORITHMS

Hierarchical Clustering (example)

  1. Hierarchical Clustering fatto a mano come in Statology ma con il data set della Lecture nmr_bins…csv

https://www.statology.org/hierarchical-clustering-in-r/

Se non hai tempo o non si riesce l’alternativa è che li faccio giocare anche loro con il MetaboAnalyst anche nelle esercitazioni, sperando che la rete regga e la piattaforma pure..

_______

SAMPLE SIZE… 🙀 a.k.a. “the 1,000,000 $ question”!

_______

Final thoughts/recommendations

  • The analyses proposed in this Lab are very similar to the process we go through in real life. The following steps are always included:

    • Thorough understanding of the input data and the data collection process
    • Bivariate analysis of correlation / association to form an intuition of which explanatory variable(s) may or may not affect the response variable
    • Diagnostic plots to verify if the necessary assumptions are met for a linear model to be suitable
    • Upon verifying the assumptions, we fit data to hypothesized (linear) model
    • Assessment of the model performance (\(R^2\), \(Adj. R^2\), \(F-Statistic\), etc.)
  • As we saw with hypothesis testing, the assumptions we make (and require) for regression are of utter importance

  • Clearly, we only scratched the surface in terms of all the possible predictive models, but we got a hang of the fundamental steps and some useful tools that might serve us also in more advanced analysis

    • e.g. broom (within tidymodels), performace rstatix, lmtest